Automatically Creating Bilingual Lexicons for Machine 
Translation from Bilingual Text 
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Abstract 

A method is presented for automatically aug- 
menting the bilingual lexicon of an existing Ma- 
chine Translation system, by extracting bilin- 
gual entries from aligned bilingual text. The 
proposed method only relies on the resources 
already available in the MT system itself. It is 
based on the use of bilingual lexical templates 
to match the terminal symbols in the parses of 
the aligned sentences. 



] Introduction 



A novel approach to automatically building 
bilingual lexicons is presented here. The term 
bilingual lexicon denotes a collection of complex 
equivalences as used in Machine Translation 
(MT) transfer lexicons, not just word equiva- 
lences. In addition to words, such lexicons in- 
volve syntactic and semantic descriptions and 
means to perform a correct transfer between the 
two sides of a bilingual lexical entry. 

A symbolic, rule-based approach of the parse- 
parse-match kind is proposed. The core idea 
is to use the resources of bidirectional transfer 
MT systems for this purpose, taking advantage 
of their features to convert them to a novel use. 



lexicons to produce translations, it is proposed 
to have them use translations to produce bilin- 
gual lexicons. Although other uses might be 
conceived, the most appropriate use is to have 
an MT system automatically augment its own 
bilingual lexicon from a small initial sample. 

The core of the described approach consists 
of using a set of bilingual lexical templates in 
matching the parses of two aligned sentences 
and in turning the lexical equivalences thus es- 
tablished into new bilingual lexical entries. 
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2 Theoretical framework 

The basic requirement that an MT system 
should meet for the present purpose is to be 
bidirectional. Bidirectionality is required in or- 
der to ensure that both source and target gram- 
mars can be used for parsing and that transfer 
can be done in both directions. More precisely, 
what is relevant is that the input and output to 
transfer be the same kind of structure. 

Moreover, the proposed method is most pro- 
ductive with a lexicalist MT system flWhite- 
lock, 1994). The proposed application is con- 



cerned with producing bilingual lexical knowl- 
edge and this sort of knowledge is the only type 
of bilingual knowledge required by lexicalist sys- 
tems. Nevertheless, it is also conceivable that 
the present approach can be used with a non- 
lexicalist transfer system, as long as the system 
is bidirectional. In this case, only the lexical 
portion of the bilingual knowledge can be au- 
tomatically produced, assuming that the struc- 
tural transfer portion is already in place. In 
the rest of this paper, a lexicalist MT system 
will be assumed and referred to. For the spe- 
cific implementation described here and all the 
examples, we will refer to an existing lexicalist 



In addition to having them use tlieir bilingual 1997). 



English-Spanish MT system (Popowich et al 



The main feature of a lexicalist MT system is 
that it performs no structural transfer. Transfer 
is a mapping between a bag of lexical items used 
in parsing (the source bag) and a corresponding 
bag of target lexical items (the target bag), to 
be used in generation. The source bag actu- 
ally contains more information than the corre- 
sponding bag of lexical items before parsing. Its 
elements get enriched with additional informa- 
tion instantiated during the parsing process. In- 
formation of fundamental importance included 
therein is a system of indices that express de- 



pendencies among lexical items. Such depen- 
dencies are transferred to the target bag and 
used to constrain generation. The task of gen- 
eration is to find an order in which the lexical 
items can be successfully parsed. 

3 Bilingual templates 

A bilingual template is a bilingual entry in which 
words are left unspecified. E.g.: 

(1) _ :: (L,@count_noun(A) ) <-> 
_ : : (R,@noun(A)) 
\\transjioun(L,R) . 



Here, a ' : : ' operator connects a word (a vari- 
able, in a template) to a description, con- 
nects the left and right sides of the entry, '\\' 
introduces a transfer macro, which takes two 
descriptions as arguments and performs some 
additional transfer ( [Turcato et al., 1997 ). De- 
scriptions are mainly expressed by macros, in- 
troduced by a operator. The macro argu- 
ments are indices, as used in lexicalist transfer. 

Templates have been widely used in MT 
( Buschbeck-Wolf and Dorna, 1997j ), particu- 



larly in the Example-Based Machine Transla- 
tion (EBMT) framework flKaji et al. (1992| ), 



Giivenir and Tunc (1996)). However, in 
EBMT, templates are most often used to model 
sentence-level correspondences, rather then lex- 
ical equivalences. Consequently, in EBMT the 
relation between lexical equivalences and tem- 
plates is the reverse of what is being proposed 
here. In EBMT, lexical equivalences are as- 
sumed and (sentential) templates are inferred 
from them. In the present framework, sentential 
correspondences (in the form of possible combi- 
nations of lexical templates) are assumed and 
lexical equivalences are inferred from them. 

In a lexicalist approach, the notion of bilin- 
gual lexical entry, and thus that of bilingual 
template, must be intended broadly. Multiword 
entries can exist. They can express dependen- 
cies among lexical items, thus being suitable for 
expressing phrasal equivalences. In brief, bilin- 
gual lexical entries can exhaustively cover all the 
bilingual information needed in transfer. 

In a lexicalist MT system, transfer is accom- 
plished by finding a bag of bilingual entries par- 
titioning the source bag. The source side of each 
entry (in the rest of this paper: the left hand 
side) corresponds to a cell of the partition. The 



union of the target sides of the entries consti- 
tutes the target bag. E.g.: 

(2) a. Source bag: 

{Sw\::Sdi, Sw2-'-Sda, Sw 3 ::Sd 3 } 

b. Bilingual entries: 
{Swi::Sdi & Sw 3 ::Sd 3 
Twi-Tdi & Tw 2 ::Td 2 , 

Sw2-'Sd2 <-> 

Tw 3 ::Td 3 & Tw 4 ::Td 4 } 

c. Target bag: 

{Twi'.:Tdi, Tw 2 ::Td2, Tw 3 ::Td 3 , 
Tw±:: Tdi} 

where each Swi'/.Sdi and Twf.'.Tdi are, respec- 
tively, a source and target < Word,Description> 
pair. In addition, the bilingual entries must sat- 
isfy the constraints expressed by indices in the 
source and target bags. The same information 



can be used to find (2b), given (2a) and (2c). 

Any bilingual lexicon is partitioned by a set of 
templates. The entries in each equivalence class 
only differ by their words. A bilingual lexical en- 
try can thus be viewed triple <Sw,Tw,T>, 
where Sw is a list of source words, Tw a list of 
target words, and T a template. A set of such 
bilingual templates can be intuitively regarded 
as a 'transfer grammar'. A grammar defines all 
the possible sequences of pre-terminal symbols, 
i.e. all the possible types of sentences. Anal- 
ogously, a set of bilingual templates defines all 
the possible translational equivalences between 
bags of pre-terminal symbols, i.e. all the possi- 
ble equivalences between types of sentences. 

Using this intuition, the possibility is ex- 
plored of analyzing a pair of such bags by means 
of a database of bilingual templates, to find a 
bag of templates that correctly accounts for the 
translational equivalence of the two bags, with- 
out resorting to any information about words. 
In the example (Q), the following bag of tem- 
plates would be the requested solution: 

(3) {.::Sd 1 & _::Sd 3 <-► _:: Td\ & J.:Td 2 , 
s.:Sd2 <-> ---Td 3 & j.-.Td^} 

Equivalences between (bags of) words are au- 
tomatically obtained result of the process, 
whereas in translating they are assumed and 
used to select the appropriate bilingual entries. 
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97.5 % 


922 
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Table 1: Incremental template coverage 

The whole idea is based on the assumption 
that a lexical item's description and the con- 
straints on its indices are sufficient in most cases 
to uniquely identify a lexical item in a parse out- 
put bag. Although exceptions could be found 
(most notably, two modifiers of the same cate- 
gory modifying the same head), the idea is vi- 
able enough to be worth exploring. 

The impression might arise that it is difficult 
and impractical to have a set of templates avail- 
able in advance. However, there is empirical ev- 
idence to the contrary. A count on the MT sys- 
tem used here showed that a restricted number 
of templates covers a large portion of a bilingual 
lexicon. Table [l] shows the incremental cover- 
age. Although completeness is hard to obtain, 
a satisfactory coverage can be achieved with a 
relatively small number of templates. 

In the implementation described here, a set of 
templates was extracted from the MT bilingual 
lexicon and used to bootstrap further lexical 
development. The whole lexical development 
can be seen as an interactive process involv- 
ing a bilingual lexicon and a template database. 
Templates are initially derived from the lexi- 
con, new entries are successively created using 
the templates. Iteratively, new entries can be 
manually coded when the automatic procedure 
is lacking appropriate templates and new tem- 
plates extracted from the manually coded en- 
tries can be added to the template database. 

4 The algorithm 

In this section the algorithm for creating bilin- 
gual lexical entries is described, along with a 
sample run. The procedure was implemented 
in Prolog, as was the MT system at hand. Ba- 
sically, a set of lexical entries is obtained from a 



pair of sentences by first parsing the source and 
target sentences. The source bag is then trans- 
ferred using templates as transfer rules (plus en- 
tries for closed-class words and possibly a pre- 
existing bilingual lexicon). The transfer out- 
put bag is then unified with the target sentence 
parse output bag. If the unification succeeds, 
the relevant information (bilingual templates 
and associated words) is retrieved to build up 
the new bilingual entries. Otherwise, the sys- 
tem backtracks into new parses and transfers. 

The main predicate make_entries/3 matches 
a source and a target sentence to produce a set 
of bilingual entries: 

make_entries (Source .Target , Entries) : - 
parse_source (Source ,Derivl) , 
parse_target (Target ,Deriv2) , 
transfer (Derivl ,Deriv3) , 
get_bag(Deriv2,Bag2) , 
get_bag(Deriv3,Bag3) , 
match_bags(Bag2,Bag3,Bag4) , 
get_bag(Derivl ,Bagl) , 
make_be_inf o (Bagl , Bag4 , Deriv3 , Be) , 
be_inf o_to_entries(Be,Entries) . 

Each Derivn variable points to a buffer where 
all the information about a specific derivation 
(parse or transfer) is stored and each Bagn vari- 
able refers to a bag of lexical items. Each step 
will be discussed in detail in the rest of the sec- 
tion. A sample run will be shown for the fol- 
lowing English-Spanish pair of sentences: 

(4) a. the fat man kicked out the black 
dog. 

b. el hombre gordo echo el perro 
negro . 

In the sample session no bilingual lexicon was 
used for content words. Only a bilingual lexi- 
con for closed class words and a set of bilingual 
templates were used. Therefore, new bilingual 
entries were obtained for all the content words 
(or phrases) in the sentences. 

4.1 Source sentence parse 

The parse of the source sentence is performed 
by parse_source/2. The parse tree is shown in 
Fig. p]. Since only lexical items are relevant for 
the present purposes, only pre-terminal nodes 
in the tree are labeled. 
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Figure 1: Source sentence parse tree. 
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Figure 2: Source sentence parse output bag. 

Fig. [2] shows, in succint form, the relevant 
information from the source bag, i.e. the bag 
resulting from parsing the source sentence. All 
the syntactic and semantic information has been 
omitted and replaced by a category label. What 
is relevant here is the way the indices are set, as 
a result of parsing. The words {the , fat , man} 
are tied together and so are {kick, out} and 
{the, black, dog}. Moreover, the indices of 
'kick' show that its second index is tied to its 
subject, {the, fat, man}, and its third index is 
tied to its object, {the, black, dog}. 

4.2 Target sentence parse 

The parse of the target sentence is performed 
by parse_target/2. Fig. || and § show, 
respectively, the resulting tree and bag. In 
an analogous manner to what is seen in 
the source sentence, {el,hombre,gordo} and 
{el , perro , negro} are, respectively, the sub- 
ject and the object of 'echo'. 

4.3 Transfer 

The result of parsing the source sentence is used 
by transf er/2 to create a translationally equiv- 
alent target bag. Fig. [B| shows the result. Trans- 
fer is performed by consulting a bilingual lexi- 
con, which, in the present case, contained en- 



Figure 3: Target sentence parse tree. 
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Figure 4: Target sentence parse output bag. 

tries for closed class words (e.g. an entry map- 
ping 'the' to 'el') and templates for content 
words. The templates relevant to our example 
are the following: 

(5) a. _ : :@adj (A) 

<-> ' word (adj /adj , 1) ' ::@adj(A). 

b. _ : : (L,@count_noun(A)) 

«-> ' word(cn/n, 1) ' : : (R, ©noun (A) ) 
\\trans_noun(L,R) . 

c. _ : : (L,@trans_verb(A,B,C)) 
& _ :: ©advparticle (A) 

'word(tv+adv/tv, 1) ' : : 
(R , @verb_acc (A , B , C) ) 
\\trans_verb(L,R) . 



Id Word 



Cat Indices 



2- 1 el d [A] 

3- 2 word (adj /adj , 1) adj [A] 

4- 3 word(cn/n,l) n [A] 
1-4 word(tv+adv/tv, 1) v [B,A,I] 

5- 6 el d [I] 

6- 7 word(adj/adj , 1) adj [I] 

7- 8 word(cn/n,l) n [I] 

Figure 5: Transfer output bag. 



Bilingual templates are simply bilingual en- 
tries with words replaced by variables. Actually, 
on the target side, words are replaced by labels 
of the form word (Ti , Position) , where Ti is a 
template identifier and Position identifies the 
position of the item in the right hand side of the 
template. Thus, a label word(adj/adj , 1) iden- 
tifies the first word on the right hand side of the 
template that maps an adjective to an adjective. 
Such labels are just implementational technical- 
ities that facilitate the retrieval of the relevant 
information when a lexical entry is built up from 
a template, but they have no role in the match- 
ing procedure. For the present purposes they 
can entirely be regarded as anonymous variables 
that can unify with anything, exactly like their 
source counterparts. 

After transfer, the instances of the templates 
used in the process are coindexed in some way, 
by virtue of their unification with the source bag 
items. This is analogous to what happens with 
bilingual entries in the translation process. 

4.4 Target bag matching 

The predicate get_bag/2 retrieves a bag of lex- 
ical items associated with a derivation. There- 
fore, Bag2 and Bag3 will contain the bags of 
lexical items resulting, respectively, from pars- 
ing the target sentence and from transfer. 

The crucial step is the matching between the 
transfer output bag and the target sentence 
parse output bag. The predicate match_bags/3 
tries to unify the two bags (returning the result 
in Bag4). A successful unification entails that 
the parse and transfer of the source sentence 
are consistent with the parse of the target sen- 
tence. In other words, the bilingual rules used 
in transfer correctly map source lexical items 
into target lexical items. Therefore, the lexi- 
cal equivalences newly established through this 
process can be asserted as new bilingual entries. 

In the matching process, the order in which 
the elements are listed in the figures is irrele- 
vant, since the objects at hand are bags, i.e. 
unordered collections. A successful match only 
requires the existence of a one-to-one mapping 
between the two bags, such that: 

(i) the respective descriptions, here repre- 
sented by category labels, are unifiable; 

(ii) a further one-to-one mapping between the 
indices in the two bags is induced. 



The following mapping between the transfer 
output bag (Fig. ||) and the target sentence 
parse output bag (Fig. ||]) will therefore succeed: 

{<2-l,l>,<3-2,3>,<4-3,2>,<l-4,4>, 
<5-6,5>,<6-7,7>,<7-8,6>} 

In fact, in addition to correctly unifying the 
descriptions, it induces the following one-to-one 
mapping between the two sets of indices: 

{<A,0>,<B ) 1>,<I,13>} 

4.5 Bilingual entries creation 

The rest of the procedure builds up lexical en- 
tries for the newly discovered equivalences and 
is implementation dependent. First, the source 
bag is retrieved in Bagl. Then, make_be_inf o/4 
links together information from the source bag, 
the target bag (actually, its unification with 
the target sentence parse bag) and the trans- 
fer derivation, to construct a list of terms (the 
variable Be) containing the information to cre- 
ate an entry. Each such term has the form 
be(Sw,Tw,Ti) , where Sw is a list of source 
words, Tw is a list of target words and Ti is 
a template identifier. In our example, the fol- 
lowing be/3 terms are created: 

(6) a. be ([fat] , [gordo] ,adj/adj) 

b. be ( [man] , [hombre] , cn/n) 

c. be( [kick, out] , [echar] ,tv+adv/tv) 

d. be ([black] , [negro] ,adj/adj) 

e. be ([dog] , [perro] ,cn/n) 

Each be/3 term is finally turned 
into a bilingual entry by the predicate 
be_inf o_to_entries/2. The following bilin- 
gual entries are created: 

(7) a. fat : :@adj (A) 

<-> gordo : : @adj (A) . 

b. man : : (D,@count_noun(C)) 
<-» hombre : : (B,@noun(C) ) 
\\trans_noun(D ,B) . 

c. kick : : (I ,@trans_verb(F,G,H) ) 
& out : : Oadvparticle (F) 

<->• 

echar : : (E,@verb_acc(F,G,H)) 
\\trans_verb(I ,E) . 



d. black : :@adj (J) 

<-> negro : :@adj (J) . 

e. dog : : (M,@count_noun(L) ) 
<-> hombre : : (K,@noun(L) ) 
\\trans_noun(M,K) . 

If a pre-existing bilingual lexicon is in use, 
bilingual entries are prioritized over bilingual 
templates. Consequently, only new entries are 
created, the others being retrieved from the ex- 
isting bilingual lexicon. Incidentally, it should 
be noted that a new entry is an entry which 
differs from any existing entry on either side. 
Therefore, different entries are created for dif- 
ferent senses of the same word, as long as the 
different senses have different translations. 

5 Shortcomings and future work 

In matching a pair of bags, two kinds of ambigu- 
ity could lead to multiple results, some of which 
are incorrect. Firstly, as already mentioned, a 
bag could contain two lexical items with unifi- 
able descriptions (e.g. two adjectives modify- 
ing the same noun), possibly causing an incor- 
rect match. Secondly, as the bilingual template 
database grows, the chance of overlaps between 
templates also grows. Two different templates 
or combinations of templates might cover the 
same input and output. A case in point is that 
of a phrasal verb or an idiom covered by both a 
single multi-word template and a compositional 
combination of simpler templates. 

As both potential sources of error can be au- 
tomatically detected, a first step in tackling the 
problem would be to block the automatic gener- 
ation of the entries involved when a problematic 
case occurs, or to have a user select the correct 
candidate. In this way the correctness of the 
output is guaranteed. The possible cost is a 
lack of completeness, when no user intervention 
is foreseen. 

Furthermore, techniques for the automatic 
resolution of template overlaps are under inves- 
tigation. Such techniques assume the presence 
of a bilingual lexicon. The information con- 
tained therein is used to assign preferences to 
competing candidate entries, in two ways. 

Firstly, templates are probabilistically 
ranked, using the existing bilingual lexicon 
to estimate probabilities. When the choice 
is between single entries, the ranking can be 



performed by counting the frequency of each 
competing template in the lexicon. The entry 
with the most frequent template is chosen. 

Secondly, heuristics are used to assign pref- 
erences, based on the presence of pre-existing 
entries related in some way to the candidate 
entries. This technique is suited for resolv- 
ing ambiguities where multiple entries are in- 
volved. For instance, given the equivalence 
between 'kick the bucket' and 'estirar la 
pata', and the competing candidates 

(8) a. {kick & bucket <-> estirar & pata} 
b. {kick <-> estirar, bucket <-> pata} 

the presence of an entry 'bucket <-> balde' in 
the bilingual lexicon might be a clue for prefer- 
ring the idiomatic interpretation. Conversely, if 
the hypothetical entry 'bucket <-> pata' were 
already in the lexicon, the compositional inter- 
pretation might be preferred. 

Finally, efficiency is also dependant on the re- 
strictiveness of grammars. The more grammars 
overgenerate, the more the combinatoric inde- 
terminacy in the matching process increases. 
However, overgeneration is as much a problem 
for translation as for bilingual generation. In 
other words, no additional requirement is placed 
on the MT system which is not independently 
motivated by translation alone. 

6 Conclusion 

The parse-parse-match approach to automati- 
cally building bilingual lexicons in not novel. 
Proposals have been put forward, e.g., by Sadlerj 
|and Vendelmans (1990|) and [Kaji et al. (1992| ). 

Wu (199 points out some possible difficul- 
ties of the parse-parse-match approach. Among 
them, the facts that "appropriate, robust, 
monolingual grammars may not be available" 
and "the grammars may be incompatible across 
languages" ( |Wu, 1995| , 355). More generally, 
in bilingual lexicon development there is a ten- 
dency to minimize the need for linguistic re- 
sources specifically developed for the purpose. 
In this view, several proposals tend to use statis- 
tical, knowledge-free methods, possibly in com- 
bination with the use of existing Machine Read- 



able Dictionaries (see, e.g., Klavans and Tzouk- 
jermann (1995| ), which also contains a survey of 
related proposals, pages 195-196). 



The present proposal tackles the problem 
from a different and novel perspective. The ac- 
knowledgment that MT is the main application 
domain to which bilingual resources are relevant 
is taken as a starting point. The existence of an 
MT system, for which the bilingual lexicon is 
intended, is explicitly assumed. The potential 
problems due to the need for linguistic resources 
are by-passed by having the necessary resources 
available in the MT system. Rather than doing 
away with linguistic knowledge, the pre-existing 
resources of the pursued application are utilized. 

An approach like the present can be most ef- 
fectively adopted to develop tools allowing MT 
systems to automatically build their own bilin- 
gual lexicons. A tool of this sort would use 
no extra resources in addition to those already 
available in the MT system itself. Such a tool 
would take a small sample of a bilingual lexicon 
and use it to bootstrap the automatic devel- 
opment of a large lexicon. It is worth noting 
that the bilingual pairs thus produced would be 
complete bilingual entries that could be directly 
incorporated in the MT system, with no post- 
editing or addition of information. 

The only requirement placed by the present 
approach on MT systems is that they be bi- 
directional. Therefore, although aimed at the 
development of specific applications for specific 
MT systems, the approach is general enough to 
apply to a wide range of MT systems. 
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